Skip to content

BUG: read_excel for ods files raising UnboundLocalError in certain cases #36175

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Sep 13, 2020

Conversation

asishm
Copy link
Contributor

@asishm asishm commented Sep 6, 2020

Need some guidance on tests here. Do we need tests for other file formats here as well? Also unsure about naming conventions.

The primary bug in this case (apart from the indentation problem in the original code) is that the code was ignoring cases where the XML of the cell had multiple child nodes that were not spaces

reverted implementation from #33233 incorporating part of its modifications.

super().__init__(filepath_or_buffer, storage_options=storage_options)

def _monkeypatch_odf_element_str(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woa, why don't you just change the impl to be correct? we never want to monkeypatch things in library code.

Copy link
Contributor Author

@asishm asishm Sep 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should not be monkeypatching the library code and ideally this should be fixed in odfpy directly.

If you look at the fix it essentially does the exact thing that the library does https://github.com/eea/odfpy/blob/master/odf/element.py#L240-L244 except add the case that PR #33233 addresses. This is relevant mostly because these Nodes can get specially if the nested child nodes also contain multiple spaces that #33233 addresses. Then it would become a recursive implementation.

For a cell like this:
image
there is no way to get all the spaces without recursive (or equivalent). The original PR is also somewhat of a monkeypatch as it essentially replaced str(cell) with _get_cell_string_value(cell)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right why can't you just change _get_string_cell_value to what you are doing here? (and put a nice comment / link to the source & issue)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. done

not sure why the tests are failing. they seem unrelated.

@jreback jreback added the IO Excel read_excel, to_excel label Sep 7, 2020
@WillAyd
Copy link
Member

WillAyd commented Sep 9, 2020

Looks good - can you add a what’s new note for 1.2 bug fix section?

@asishm
Copy link
Contributor Author

asishm commented Sep 9, 2020

@WillAyd added what's new note

# https://github.com/pandas-dev/pandas/pull/36175#discussion_r484639704
value.append(self._get_cell_string_value(fragment))
else:
value.append(str(fragment))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to call str(fragment) instead of always just calling self._get_cell_string_value(fragment)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we'd need to handle the str method for all the different cases which would require going into the spec. odfpy already has the str implementations done. the only case that was problematic was the multiple spaces that was addressed in the previous linked PR.

value.append(" " * spaces)
else:
# recursive impl needed in case of nested fragments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a reader it isn't clear to me what this means; is this a bug that needs to be fixed upstream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before #33233, the value was str(cell) which relied on the upstream odfpy implementation.

#33233 attempted to fix cases where multiple spaces was getting skipped in the output. That is because odfpy's str implementations for Nodes/Elements that are of that specific type (from odf.text import S) does not include the actual number of spaces. This imo should have been addressed upstream.

#33233 also introduced the bug that is causing the current UnboundLocalError as one line is misaligned. But fixing the misalignment adds in more bugs. With just the indentation fix, the output will still be missing certain fragments from the cell (for the file in #36122).

This was my original reason for doing a monkeypatch that patched __str__ implementation of the Element nodes to include the S spaces Node. I changed the monkeypatch but instead changed the implementation of _get_cell_string_value to have the same behavior as the monkeypatch. It still relies on odfpy's str implementation for all other Element types except for the Space Node, otherwise it pretty much mirrors Element.__str__ from odfpy.

Sorry for the wall of text.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment, but lgtm.

@@ -296,6 +296,7 @@ I/O
- :meth:`to_csv` did not support zip compression for binary file object not having a filename (:issue: `35058`)
- :meth:`to_csv` and :meth:`read_csv` did not honor `compression` and `encoding` for path-like objects that are internally converted to file-like objects (:issue:`35677`, :issue:`26124`, and :issue:`32392`)
- :meth:`to_picke` and :meth:`read_pickle` did not support compression for file-objects (:issue:`26237`, :issue:`29054`, and :issue:`29570`)
- Bug in :meth:`read_excel` with `engine="odf"` caused UnboundLocalError in some cases where cells had nested child nodes (:issue:`36122`, and :issue:`35802`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double back ticks on UnboundLocalError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jreback jreback added this to the 1.2 milestone Sep 13, 2020
@jreback jreback merged commit 88bc2e4 into pandas-dev:master Sep 13, 2020
@jreback
Copy link
Contributor

jreback commented Sep 13, 2020

thanks @asishm

@simonjayhawkins
Copy link
Member

from #36122 (comment)

This used to work fine (one month ago on my machine). I tried upgrading to Pandas 1.1.1 and get the same bug. It looks like what's below:

could this fix go in 1.1.3

@jreback
Copy link
Contributor

jreback commented Sep 14, 2020

yep could backport

@simonjayhawkins
Copy link
Member

simonjayhawkins commented Sep 14, 2020

ok will raise PR on master first to move release note and then do a manual backport.

can't auto backport this (followed by release note move) as no doc/source/whatsnew/v1.2.0.rst on 1.1.x

@simonjayhawkins
Copy link
Member

ok will raise PR on master first to move release note and then do a manual backport.

second thoughts will do the backport first and then if all ok move release note on master to be in sync

jreback pushed a commit that referenced this pull request Sep 14, 2020
…lError in certain cases (#36355)

Co-authored-by: Asish Mahapatra <[email protected]>
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this pull request Sep 14, 2020
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this pull request Sep 15, 2020
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this pull request Sep 15, 2020
jreback pushed a commit that referenced this pull request Sep 16, 2020
jreback pushed a commit that referenced this pull request Sep 16, 2020
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this pull request Sep 16, 2020
rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request Sep 17, 2020
rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request Sep 17, 2020
kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
@asishm asishm deleted the ods branch January 5, 2021 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Excel read_excel, to_excel
Projects
None yet
4 participants